Why can't I reproduce benchmark scores from papers like Phi, Llama, or Qwen? Am I doing something wrong or is this normal?

I’m working on evaluating open-source LLMs (e.g., Phi, Llama, Qwen), and I’ve noticed that the benchmark scores I get are consistently different from the ones reported in their tech reports or papers — sometimes by a wide margin.

Sometimes the results are lower than expected, but surprisingly, sometimes they’re higher too. My point is: there are (many) cases where the difference is quite large, and it’s not clear why.

I’ve tried:

  • Using lm-eval-harness with the default settings
  • Matching tokenizers and prompt formats as best as possible
  • Evaluating on standard benchmarks like MMLU, GSM8K, ARC, etc, in the reports under the same few-shot conditions

Despite this, the scores I get are often significantly different from what’s published — and I can’t find any official scripts or clear explanations of the exact benchmarking setup used in those papers.

This seems to happen not just with one model, but across many open-source models.

Is this a common experience in the community?

  • Are papers using special prompt engineering or internal eval setups they don’t release?
  • Am I missing some key benchmarking tricks?
  • Is this just part of the game at this point?

Would really appreciate if anyone can share:

  • Experience trying to reproduce scores
  • Any evaluation tips
  • Benchmarking setups that actually match reported numbers

Thanks in advance!

1 Like

The backend, or rather, if the library version or options passed to the generation function (such as temperature) are different, the results may vary, so I think it can only be used as a rough guide. Leaderboards are easy to compare because they use the same criteria within the same leaderboard, but I don’t think there are many absolute indicators that can be used. For large companies, I think the output of the endpoints officially provided by the company can be used as a reference.